{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# General Structured Output Models with Shogun Machine Learning Toolbox"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Shell Hu (GitHub ID: [hushell](https://github.com/hushell))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Thanks Patrick Pletscher and Fernando J. Iglesias García for taking time to help me finish the project! Shoguners = awesome! Me = grateful!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook illustrates the training of a <a href=\"http://en.wikipedia.org/wiki/Factor_graph\">factor graph</a> model using  <a href=\"http://en.wikipedia.org/wiki/Structured_support_vector_machine\">structured SVM</a> in Shogun. We begin by giving a brief outline of factor graphs and <a href=\"http://en.wikipedia.org/wiki/Structured_prediction\">structured output learning</a> followed by the corresponding API in Shogun. Finally, we test the scalability by performing an experiment on a real <a href=\"http://en.wikipedia.org/wiki/Optical_character_recognition\">OCR</a> data set for <a href=\"http://en.wikipedia.org/wiki/Handwriting_recognition\">handwritten character recognition</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Factor Graph"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A factor graph explicitly represents the factorization of an undirected graphical model in terms of a set of factors (potentials), each of which is defined on a clique in the original graph [1]. For example, a MRF distribution can be factorized as \n",
    "\n",
    "$$\n",
    "P(\\mathbf{y}) = \\frac{1}{Z} \\prod_{F \\in \\mathcal{F}} \\theta_F(\\mathbf{y}_F),\n",
    "$$\n",
    "\n",
    "where $F$ is the factor index, $\\theta_F(\\mathbf{y}_F)$ is the energy with respect to assignment $\\mathbf{y}_F$. In this demo, we focus only on table representation of factors. Namely, each factor holds an energy table $\\theta_F$, which can be viewed as an unnormalized CPD. According to different factorizations, there are different types of factors. Usually we assume the Markovian property is held, that is, factors have the same parameterization if they belong to the same type, no matter how location or time changes. In addition, we have parameter free factor type, but nothing to learn for such kinds of types. More detailed implementation will be explained later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Structured Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Structured prediction typically involves an input $\\mathbf{x}$ (can be structured) and a structured output $\\mathbf{y}$. A joint feature map $\\Phi(\\mathbf{x},\\mathbf{y})$ is defined to incorporate structure information into the labels, such as chains, trees or general graphs. In general, the linear parameterization will be used to give the prediction rule. We leave the kernelized version for future work.\n",
    "\n",
    "$$\n",
    "\\hat{\\mathbf{y}} = \\underset{\\mathbf{y} \\in \\mathcal{Y}}{\\operatorname{argmax}} \\langle \\mathbf{w}, \\Phi(\\mathbf{x},\\mathbf{y}) \\rangle \n",
    "$$\n",
    "\n",
    "where $\\Phi(\\mathbf{x},\\mathbf{y})$ is the feature vector by mapping local factor features to corresponding locations in terms of $\\mathbf{y}$, and $\\mathbf{w}$ is the global parameter vector. In factor graph model, parameters are associated with a set of factor types. So $\\mathbf{w}$ is a collection of local parameters. \n",
    "\n",
    "The parameters are learned by regularized risk minimization, where the risk defined by user provided loss function $\\Delta(\\mathbf{y},\\mathbf{\\hat{y}})$ is usually non-convex and non-differentiable, e.g. the Hamming loss. So the empirical risk is defined in terms of the surrogate hinge loss $H_i(\\mathbf{w}) = \\max_{\\mathbf{y} \\in \\mathcal{Y}} \\Delta(\\mathbf{y}_i,\\mathbf{y}) - \\langle \\mathbf{w}, \\Psi_i(\\mathbf{y}) \\rangle $, which is an upper bound of the user defined loss. Here $\\Psi_i(\\mathbf{y}) = \\Phi(\\mathbf{x}_i,\\mathbf{y}_i) - \\Phi(\\mathbf{x}_i,\\mathbf{y})$. The training objective is given by\n",
    "\n",
    "$$\n",
    "\\min_{\\mathbf{w}} \\frac{\\lambda}{2} ||\\mathbf{w}||^2 + \\frac{1}{N} \\sum_{i=1}^N H_i(\\mathbf{w}). \n",
    "$$\n",
    "\n",
    "In Shogun's factor graph model, the corresponding implemented functions are:\n",
    "\n",
    "- <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CStructuredModel.html#a15bd99e15bbf0daa8a727d03dbbf4bcd\">FactorGraphModel::get_joint_feature_vector()</a> $\\longleftrightarrow \\Phi(\\mathbf{x}_i,\\mathbf{y})$ \n",
    "\n",
    "- <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CFactorGraphModel.html#a36665cfdd7ea2dfcc9b3c590947fe67f\">FactorGraphModel::argmax()</a> $\\longleftrightarrow H_i(\\mathbf{w})$\n",
    "\n",
    "- <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CFactorGraphModel.html#a17dac99e933f447db92482a6dce8489b\">FactorGraphModel::delta_loss()</a> $\\longleftrightarrow \\Delta(\\mathbf{y}_i,\\mathbf{y})$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experiment: OCR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First of all, we load the OCR data from a prepared mat file. The raw data can be downloaded from <a href=\"http://www.seas.upenn.edu/~taskar/ocr/\">http://www.seas.upenn.edu/~taskar/ocr/</a>. It has 6876 handwritten words with an average length of 8 letters from 150 different persons. Each letter is rasterized into a binary image of size 16 by 8 pixels. Thus, each $\\mathbf{y}$ is a chain, and each node has 26 possible states denoting ${a,\\cdots,z}$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import scipy.io"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dataset = scipy.io.loadmat('../../../data/ocr/ocr_taskar.mat')\n",
    "# patterns for training\n",
    "p_tr = dataset['patterns_train']\n",
    "# patterns for testing\n",
    "p_ts = dataset['patterns_test']\n",
    "# labels for training\n",
    "l_tr = dataset['labels_train']\n",
    "# labels for testing\n",
    "l_ts = dataset['labels_test']\n",
    "\n",
    "# feature dimension\n",
    "n_dims = p_tr[0,0].shape[0]\n",
    "# number of states\n",
    "n_stats = 26\n",
    "# number of training samples\n",
    "n_tr_samples = p_tr.shape[1]\n",
    "# number of testing samples\n",
    "n_ts_samples = p_ts.shape[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Few examples of the handwritten words are shown below. Note that the first capitalized letter has been removed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def show_word(patterns, index):\n",
    "    \"\"\"show a word with padding\"\"\"\n",
    "    plt.rc('image', cmap='binary')\n",
    "    letters = patterns[0,index][:128,:]\n",
    "    n_letters = letters.shape[1]\n",
    "    for l in xrange(n_letters):\n",
    "        lett = np.transpose(np.reshape(letters[:,l], (8,16)))\n",
    "        lett = np.hstack((np.zeros((16,1)), lett, np.zeros((16,1))))\n",
    "        lett = np.vstack((np.zeros((1,10)), lett, np.zeros((1,10))))\n",
    "        subplot(1,n_letters,l+1)\n",
    "        imshow(lett)\n",
    "        plt.xticks(())\n",
    "        plt.yticks(())\n",
    "    plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "show_word(p_tr, 174)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "show_word(p_tr, 471)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "show_word(p_tr, 57)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Factor Types and Build Factor Graphs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define 4 factor types, such that a word will be able to be modeled as a chain graph.\n",
    "\n",
    "- The unary factor type will be used to define unary potentials that capture the appearance likelihoods of each letter. In our case, each letter has $16 \\times 8$ pixels, thus there are $(16 \\times 8 + 1) \\times 26$ parameters. Here the additional bits in the parameter vector are bias terms. One for each state. \n",
    "- The pairwise factor type will be used to define pairwise potentials between each pair of letters. This type in fact gives the Potts potentials. There are $26 \\times 26$ parameters. \n",
    "- The bias factor type for the first letter is a compensation factor type, since the interaction is one-sided. So there are $26$ parameters to be learned.\n",
    "- The bias factor type for the last letter, which has the same intuition as the last item. There are also $26$ parameters.\n",
    "\n",
    "Putting all parameters together, the global parameter vector $\\mathbf{w}$ has length $4082$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import TableFactorType\n",
    "\n",
    "# unary, type_id = 0\n",
    "cards_u = np.array([n_stats], np.int32)\n",
    "w_gt_u = np.zeros(n_stats*n_dims)\n",
    "fac_type_u = TableFactorType(0, cards_u, w_gt_u)\n",
    "\n",
    "# pairwise, type_id = 1\n",
    "cards = np.array([n_stats,n_stats], np.int32)\n",
    "w_gt = np.zeros(n_stats*n_stats)\n",
    "fac_type = TableFactorType(1, cards, w_gt)\n",
    "\n",
    "# first bias, type_id = 2\n",
    "cards_s = np.array([n_stats], np.int32)\n",
    "w_gt_s = np.zeros(n_stats)\n",
    "fac_type_s = TableFactorType(2, cards_s, w_gt_s)\n",
    "\n",
    "# last bias, type_id = 3\n",
    "cards_t = np.array([n_stats], np.int32)\n",
    "w_gt_t = np.zeros(n_stats)\n",
    "fac_type_t = TableFactorType(3, cards_t, w_gt_t)\n",
    "\n",
    "# all initial parameters\n",
    "w_all = [w_gt_u,w_gt,w_gt_s,w_gt_t]\n",
    "\n",
    "# all factor types\n",
    "ftype_all = [fac_type_u,fac_type,fac_type_s,fac_type_t]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we write a function to construct the factor graphs and prepare labels for training. For each factor graph instance, the structure is a chain but the number of nodes and edges depend on the number of letters, where unary factors will be added for each letter, pairwise factors will be added for each pair of neighboring letters. Besides, the first and last letter will get an additional bias factor respectively.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def prepare_data(x, y, ftype, num_samples):\n",
    "    \"\"\"prepare FactorGraphFeatures and FactorGraphLabels \"\"\"\n",
    "    from modshogun import Factor, TableFactorType, FactorGraph\n",
    "    from modshogun import FactorGraphObservation, FactorGraphLabels, FactorGraphFeatures\n",
    "\n",
    "    samples = FactorGraphFeatures(num_samples)\n",
    "    labels = FactorGraphLabels(num_samples)\n",
    "\n",
    "    for i in xrange(num_samples):\n",
    "        n_vars = x[0,i].shape[1]\n",
    "        data = x[0,i].astype(np.float64)\n",
    "\n",
    "        vc = np.array([n_stats]*n_vars, np.int32)\n",
    "        fg = FactorGraph(vc)\n",
    "\n",
    "        # add unary factors\n",
    "        for v in xrange(n_vars):\n",
    "            datau = data[:,v]\n",
    "            vindu = np.array([v], np.int32)\n",
    "            facu = Factor(ftype[0], vindu, datau)\n",
    "            fg.add_factor(facu)\n",
    "\n",
    "        # add pairwise factors\n",
    "        for e in xrange(n_vars-1):\n",
    "            datap = np.array([1.0])\n",
    "            vindp = np.array([e,e+1], np.int32)\n",
    "            facp = Factor(ftype[1], vindp, datap)\n",
    "            fg.add_factor(facp)\n",
    "\n",
    "        # add bias factor to first letter\n",
    "        datas = np.array([1.0])\n",
    "        vinds = np.array([0], np.int32)\n",
    "        facs = Factor(ftype[2], vinds, datas)\n",
    "        fg.add_factor(facs)\n",
    "\n",
    "        # add bias factor to last letter\n",
    "        datat = np.array([1.0])\n",
    "        vindt = np.array([n_vars-1], np.int32)\n",
    "        fact = Factor(ftype[3], vindt, datat)\n",
    "        fg.add_factor(fact)\n",
    "\n",
    "        # add factor graph\n",
    "        samples.add_sample(fg)\n",
    "\n",
    "        # add corresponding label\n",
    "        states_gt = y[0,i].astype(np.int32)\n",
    "        states_gt = states_gt[0,:]; # mat to vector\n",
    "        loss_weights = np.array([1.0/n_vars]*n_vars)\n",
    "        fg_obs = FactorGraphObservation(states_gt, loss_weights)\n",
    "        labels.add_label(fg_obs)\n",
    "\n",
    "    return samples, labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# prepare training pairs (factor graph, node states)\n",
    "n_tr_samples = 350 # choose a subset of training data to avoid time out on buildbot\n",
    "samples, labels = prepare_data(p_tr, l_tr, ftype_all, n_tr_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An example of graph structure is visualized as below, from which you may have a better sense how a factor graph being built. Note that different colors are used to represent different factor types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import networkx as nx # pip install networkx\n",
    "except ImportError:\n",
    "    import pip\n",
    "    pip.main(['install', '--user', 'networkx'])\n",
    "    import networkx as nx\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# create a graph\n",
    "G = nx.Graph()\n",
    "node_pos = {}\n",
    "\n",
    "# add variable nodes, assuming there are 3 letters\n",
    "G.add_nodes_from(['v0','v1','v2'])\n",
    "for i in xrange(3):\n",
    "    node_pos['v%d' % i] = (2*i,1)\n",
    "\n",
    "# add factor nodes\n",
    "G.add_nodes_from(['F0','F1','F2','F01','F12','Fs','Ft'])\n",
    "for i in xrange(3):\n",
    "    node_pos['F%d' % i] = (2*i,1.006)\n",
    "    \n",
    "for i in xrange(2):\n",
    "    node_pos['F%d%d' % (i,i+1)] = (2*i+1,1)\n",
    "    \n",
    "node_pos['Fs'] = (-1,1)\n",
    "node_pos['Ft'] = (5,1)\n",
    "\n",
    "# add edges to connect variable nodes and factor nodes\n",
    "G.add_edges_from([('v%d' % i,'F%d' % i) for i in xrange(3)])\n",
    "G.add_edges_from([('v%d' % i,'F%d%d' % (i,i+1)) for i in xrange(2)])\n",
    "G.add_edges_from([('v%d' % (i+1),'F%d%d' % (i,i+1)) for i in xrange(2)])\n",
    "G.add_edges_from([('v0','Fs'),('v2','Ft')])\n",
    "\n",
    "# draw graph\n",
    "fig, ax = plt.subplots(figsize=(6,2))\n",
    "nx.draw_networkx_nodes(G,node_pos,nodelist=['v0','v1','v2'],node_color='white',node_size=700,ax=ax)\n",
    "nx.draw_networkx_nodes(G,node_pos,nodelist=['F0','F1','F2'],node_color='yellow',node_shape='s',node_size=300,ax=ax)\n",
    "nx.draw_networkx_nodes(G,node_pos,nodelist=['F01','F12'],node_color='blue',node_shape='s',node_size=300,ax=ax)\n",
    "nx.draw_networkx_nodes(G,node_pos,nodelist=['Fs'],node_color='green',node_shape='s',node_size=300,ax=ax)\n",
    "nx.draw_networkx_nodes(G,node_pos,nodelist=['Ft'],node_color='purple',node_shape='s',node_size=300,ax=ax)\n",
    "nx.draw_networkx_edges(G,node_pos,alpha=0.7)\n",
    "plt.axis('off')\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create the factor graph model and start training. We will use the tree max-product belief propagation to do MAP inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import FactorGraphModel, TREE_MAX_PROD\n",
    "\n",
    "# create model and register factor types\n",
    "model = FactorGraphModel(samples, labels, TREE_MAX_PROD)\n",
    "model.add_factor_type(ftype_all[0])\n",
    "model.add_factor_type(ftype_all[1])\n",
    "model.add_factor_type(ftype_all[2])\n",
    "model.add_factor_type(ftype_all[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Shogun, we implemented several batch solvers and online solvers. Let's first try to train the model using a batch solver. We choose the dual bundle method solver (<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDualLibQPBMSOSVM.html\">DualLibQPBMSOSVM</a>) [2], since in practice it is slightly faster than the primal n-slack cutting plane solver (<a a href=\"http://www.shogun-toolbox.org/doc/en/latest/PrimalMosekSOSVM_8h.html\">PrimalMosekSOSVM</a>) [3]. However, it still will take a while until convergence. Briefly, in each iteration, a gradually tighter piece-wise linear lower bound of the objective function will be constructed by adding more cutting planes (most violated constraints), then the approximate QP will be solved. Finding a cutting plane involves calling the max oracle $H_i(\\mathbf{w})$ and in average $N$ calls are required in an iteration. This is basically why the training is time consuming. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import DualLibQPBMSOSVM\n",
    "from modshogun import BmrmStatistics\n",
    "import pickle\n",
    "import time\n",
    "\n",
    "# create bundle method SOSVM, there are few variants can be chosen\n",
    "# BMRM, Proximal Point BMRM, Proximal Point P-BMRM, NCBM\n",
    "# usually the default one i.e. BMRM is good enough\n",
    "# lambda is set to 1e-2\n",
    "bmrm = DualLibQPBMSOSVM(model, labels, 0.01)\n",
    "\n",
    "bmrm.set_TolAbs(20.0)\n",
    "bmrm.set_verbose(True)\n",
    "bmrm.set_store_train_info(True)\n",
    "    \n",
    "# train\n",
    "t0 = time.time()\n",
    "bmrm.train()\n",
    "t1 = time.time()\n",
    "\n",
    "w_bmrm = bmrm.get_w()\n",
    "\n",
    "print \"BMRM took\", t1 - t0, \"seconds.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the duality gap to see if the training has converged. We aim at minimizing the primal problem while maximizing the dual problem. By the weak duality theorem, the optimal value of the primal problem is always greater than or equal to dual problem. Thus, we could expect the duality gap will decrease during the time. A relative small and stable duality gap may indicate the convergence. In fact, the gap doesn't have to become zero, since we know it is not far away from the local minima."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))\n",
    "\n",
    "primal_bmrm = bmrm.get_helper().get_primal_values()\n",
    "dual_bmrm = bmrm.get_result().get_hist_Fd_vector()\n",
    "\n",
    "len_iter = min(primal_bmrm.size, dual_bmrm.size)\n",
    "primal_bmrm = primal_bmrm[1:len_iter]\n",
    "dual_bmrm = dual_bmrm[1:len_iter]\n",
    "\n",
    "# plot duality gaps\n",
    "xs = range(dual_bmrm.size)\n",
    "axes[0].plot(xs, (primal_bmrm-dual_bmrm), label='duality gap')\n",
    "axes[0].set_xlabel('iteration')\n",
    "axes[0].set_ylabel('duality gap')\n",
    "axes[0].legend(loc=1)\n",
    "axes[0].set_title('duality gaps');\n",
    "axes[0].grid(True)\n",
    "\n",
    "# plot primal and dual values\n",
    "xs = range(dual_bmrm.size-1)\n",
    "axes[1].plot(xs, primal_bmrm[1:], label='primal')\n",
    "axes[1].plot(xs, dual_bmrm[1:], label='dual')\n",
    "axes[1].set_xlabel('iteration')\n",
    "axes[1].set_ylabel('objective')\n",
    "axes[1].legend(loc=1)\n",
    "axes[1].set_title('primal vs dual');\n",
    "axes[1].grid(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are other statitics may also be helpful to check if the solution is good or not, such as the number of cutting planes, from which we may have a sense how tight the piece-wise lower bound is. In general, the number of cutting planes should be much less than the dimension of the parameter vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# statistics\n",
    "bmrm_stats = bmrm.get_result()\n",
    "nCP = bmrm_stats.nCP\n",
    "nzA = bmrm_stats.nzA\n",
    "\n",
    "print 'number of cutting planes: %d' % nCP\n",
    "print 'number of active cutting planes: %d' % nzA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our case, we have 101 active cutting planes, which is much less than 4082, i.e. the number of parameters. We could expect a good model by looking at these statistics. Now come to the online solvers. Unlike the cutting plane algorithms re-optimizes over all the previously added dual variables, an online solver will update the solution based on a single point. This difference results in a faster convergence rate, i.e. less oracle calls, please refer to Table 1 in [4] for more detail. Here, we use the stochastic subgradient descent (<a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CStochasticSOSVM.html\">StochasticSOSVM</a>) to compare with the BMRM algorithm shown before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import StochasticSOSVM\n",
    "\n",
    "# the 3rd parameter is do_weighted_averaging, by turning this on, \n",
    "# a possibly faster convergence rate may be achieved.\n",
    "# the 4th parameter controls outputs of verbose training information\n",
    "sgd = StochasticSOSVM(model, labels, True, True)\n",
    "\n",
    "sgd.set_num_iter(100)\n",
    "sgd.set_lambda(0.01)\n",
    "    \n",
    "# train\n",
    "t0 = time.time()\n",
    "sgd.train()\n",
    "t1 = time.time()\n",
    "    \n",
    "w_sgd = sgd.get_w()\n",
    "    \n",
    "print \"SGD took\", t1 - t0, \"seconds.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We compare the SGD and BMRM in terms of the primal objectives versus effective passes. We first plot the training progress (until both algorithms converge) and then zoom in to check the first 100 passes. In order to make a fair comparison, we set the regularization constant to 1e-2 for both algorithms. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))\n",
    "\n",
    "primal_sgd = sgd.get_helper().get_primal_values()\n",
    "\n",
    "xs = range(dual_bmrm.size-1)\n",
    "axes[0].plot(xs, primal_bmrm[1:], label='BMRM')\n",
    "axes[0].plot(range(99), primal_sgd[1:100], label='SGD')\n",
    "axes[0].set_xlabel('effecitve passes')\n",
    "axes[0].set_ylabel('primal objective')\n",
    "axes[0].set_title('whole training progress')\n",
    "axes[0].legend(loc=1)\n",
    "axes[0].grid(True)\n",
    "\n",
    "axes[1].plot(range(99), primal_bmrm[1:100], label='BMRM')\n",
    "axes[1].plot(range(99), primal_sgd[1:100], label='SGD')\n",
    "axes[1].set_xlabel('effecitve passes')\n",
    "axes[1].set_ylabel('primal objective')\n",
    "axes[1].set_title('first 100 effective passes')\n",
    "axes[1].legend(loc=1)\n",
    "axes[1].grid(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As is shown above, the SGD solver uses less oracle calls to get to converge. Note that the timing is 2 times slower than they actually need, since there are additional computations of primal objective and training error in each pass. The training errors of both algorithms for each pass are shown in below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))\n",
    "\n",
    "terr_bmrm = bmrm.get_helper().get_train_errors()\n",
    "terr_sgd = sgd.get_helper().get_train_errors()\n",
    "\n",
    "xs = range(terr_bmrm.size-1)\n",
    "axes[0].plot(xs, terr_bmrm[1:], label='BMRM')\n",
    "axes[0].plot(range(99), terr_sgd[1:100], label='SGD')\n",
    "axes[0].set_xlabel('effecitve passes')\n",
    "axes[0].set_ylabel('training error')\n",
    "axes[0].set_title('whole training progress')\n",
    "axes[0].legend(loc=1)\n",
    "axes[0].grid(True)\n",
    "\n",
    "axes[1].plot(range(99), terr_bmrm[1:100], label='BMRM')\n",
    "axes[1].plot(range(99), terr_sgd[1:100], label='SGD')\n",
    "axes[1].set_xlabel('effecitve passes')\n",
    "axes[1].set_ylabel('training error')\n",
    "axes[1].set_title('first 100 effective passes')\n",
    "axes[1].legend(loc=1)\n",
    "axes[1].grid(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interestingly, the training errors of SGD solver are lower than BMRM's in first 100 passes, but in the end the BMRM solver obtains a better training performance. A probable explanation is that BMRM uses very limited number of cutting planes at beginning, which form a poor approximation of the objective function. As the number of cutting planes increasing, we got a tighter piecewise lower bound, thus improve the performance. In addition, we would like to show the pairwise weights, which may learn important co-occurrances of letters. The hinton diagram is a wonderful tool for visualizing 2D data, in which positive and negative values are represented by white and black squares, respectively, and the size of each square represents the magnitude of each value. In our case, a smaller number i.e. a large black square indicates the two letters tend to coincide.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def hinton(matrix, max_weight=None, ax=None):\n",
    "    \"\"\"Draw Hinton diagram for visualizing a weight matrix.\"\"\"\n",
    "    ax = ax if ax is not None else plt.gca()\n",
    "\n",
    "    if not max_weight:\n",
    "        max_weight = 2**np.ceil(np.log(np.abs(matrix).max())/np.log(2))\n",
    "\n",
    "    ax.patch.set_facecolor('gray')\n",
    "    ax.set_aspect('equal', 'box')\n",
    "    ax.xaxis.set_major_locator(plt.NullLocator())\n",
    "    ax.yaxis.set_major_locator(plt.NullLocator())\n",
    "\n",
    "    for (x,y),w in np.ndenumerate(matrix):\n",
    "        color = 'white' if w > 0 else 'black'\n",
    "        size = np.sqrt(np.abs(w))\n",
    "        rect = plt.Rectangle([x - size / 2, y - size / 2], size, size,\n",
    "                             facecolor=color, edgecolor=color)\n",
    "        ax.add_patch(rect)\n",
    "\n",
    "    ax.autoscale_view()\n",
    "    ax.invert_yaxis()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# get pairwise parameters, also accessible from\n",
    "# w[n_dims*n_stats:n_dims*n_stats+n_stats*n_stats]\n",
    "model.w_to_fparams(w_sgd) # update factor parameters\n",
    "w_p = ftype_all[1].get_w()\n",
    "w_p = np.reshape(w_p,(n_stats,n_stats))\n",
    "hinton(w_p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we show how to do inference with the learned model parameters for a given data point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# get testing data\n",
    "samples_ts, labels_ts = prepare_data(p_ts, l_ts, ftype_all, n_ts_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import FactorGraphFeatures, FactorGraphObservation, TREE_MAX_PROD, MAPInference\n",
    "\n",
    "# get a factor graph instance from test data\n",
    "fg0 = samples_ts.get_sample(100)\n",
    "fg0.compute_energies()\n",
    "fg0.connect_components()\n",
    "\n",
    "# create a MAP inference using tree max-product\n",
    "infer_met = MAPInference(fg0, TREE_MAX_PROD)\n",
    "infer_met.inference()\n",
    "\n",
    "# get inference results\n",
    "y_pred = infer_met.get_structured_outputs()\n",
    "y_truth = FactorGraphObservation.obtain_from_generic(labels_ts.get_label(100))\n",
    "print y_pred.get_data()\n",
    "print y_truth.get_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the end, we check average training error and average testing error. The evaluation can be done by two methods. We can either use the apply() function in the structured output machine or use the <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSOSVMHelper.html\">SOSVMHelper</a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from modshogun import LabelsFactory, SOSVMHelper\n",
    "\n",
    "# training error of BMRM method\n",
    "bmrm.set_w(w_bmrm)\n",
    "model.w_to_fparams(w_bmrm)\n",
    "lbs_bmrm = bmrm.apply()\n",
    "acc_loss = 0.0\n",
    "ave_loss = 0.0\n",
    "for i in xrange(n_tr_samples):\n",
    "\ty_pred = lbs_bmrm.get_label(i)\n",
    "\ty_truth = labels.get_label(i)\n",
    "\tacc_loss = acc_loss + model.delta_loss(y_truth, y_pred)\n",
    "\n",
    "ave_loss = acc_loss / n_tr_samples\n",
    "print('BMRM: Average training error is %.4f' % ave_loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# training error of stochastic method\n",
    "print('SGD: Average training error is %.4f' % SOSVMHelper.average_loss(w_sgd, model))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# testing error\n",
    "bmrm.set_features(samples_ts)\n",
    "bmrm.set_labels(labels_ts)\n",
    "\n",
    "lbs_bmrm_ts = bmrm.apply()\n",
    "acc_loss = 0.0\n",
    "ave_loss_ts = 0.0\n",
    "\n",
    "for i in xrange(n_ts_samples):\n",
    "\ty_pred = lbs_bmrm_ts.get_label(i)\n",
    "\ty_truth = labels_ts.get_label(i)\n",
    "\tacc_loss = acc_loss + model.delta_loss(y_truth, y_pred)\n",
    "\n",
    "ave_loss_ts = acc_loss / n_ts_samples\n",
    "print('BMRM: Average testing error is %.4f' % ave_loss_ts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# testing error of stochastic method\n",
    "print('SGD: Average testing error is %.4f' % SOSVMHelper.average_loss(sgd.get_w(), model))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1] Kschischang, F. R., B. J. Frey, and H.-A. Loeliger, Factor graphs and the sum-product algorithm, IEEE Transactions on Information Theory 2001.\n",
    "\n",
    "[2] Teo, C.H., Vishwanathan, S.V.N, Smola, A. and Quoc, V.Le, Bundle Methods for Regularized Risk Minimization, JMLR 2010.\n",
    "\n",
    "[3] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y., Support Vector Machine Learning for Interdependent and Structured Ouput Spaces, ICML 2004.\n",
    "\n",
    "[4] Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P., Block-Coordinate Frank-Wolfe Optimization for Structural SVMs, ICML 2013."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
